Skip to content

Record: Pre-quant AdamW TTT + QK-Gain 4.0 — val_bpb 1.1025 (3-seed mean)#1364

Open
stukenov wants to merge 1 commit intoopenai:mainfrom
stukenov:submission/v6-safe-prequant-ttt
Open

Record: Pre-quant AdamW TTT + QK-Gain 4.0 — val_bpb 1.1025 (3-seed mean)#1364
stukenov wants to merge 1 commit intoopenai:mainfrom
stukenov:submission/v6-safe-prequant-ttt

Conversation

@stukenov
Copy link
Copy Markdown

@stukenov stukenov commented Apr 4, 2026

Summary

  • val_bpb: 1.1025 (3-seed mean, std 0.0011)
  • Artifact: <16 MB (max 15,985,137)
  • Training: 600s on 8xH100 SXM | Eval: ~500s

Beats merged SOTA (PR #1019, 1.1147) by 0.0122 BPB = 0.0206 nats (4x the 0.005-nat threshold).

Key Innovation: Pre-quantization AdamW TTT

Standard post-quant SGD TTT fails on GPTQ-quantized models (25 failures, PR #756). We run AdamW TTT on the full-precision EMA model before GPTQ:

  1. Train 600s → EMA model (BPB 1.1463)
  2. AdamW TTT: 6 epochs, freeze first 2 blocks, cosine LR → BPB 1.1189 (-0.027 gain)
  3. Full Hessian GPTQ on adapted model → sliding BPB 1.1025

3-Seed Results

Seed Sliding BPB Artifact
1337 1.1023 15,930,573
42 1.1037 15,985,137
2025 1.1016 15,935,233
Mean 1.1025

Compliance

  • No SLOT, no n-gram cache, no eval-time adaptation
  • Pre-quant TTT adapts model before any eval scoring (Conditions 1-4 satisfied)
  • Full Hessian GPTQ calibrated on training data (inside 600s budget)

Reproduction

SEED=1337 torchrun --standalone --nproc_per_node=8 train_gpt.py

Credits

PR #1019 (@abaybektursun), PR #1306 (pre-quant TTT concept), PR #1125 (QK-Gain), PR #478 (XSA-all), PR #535 (GPTQ), PR #493 (LeakyReLU²)

Pre-quant TTT (6ep AdamW on EMA before GPTQ) gives -0.027 BPB gain.
3 seeds: 1.1023, 1.1037, 1.1016 (mean 1.1025, std 0.0011).
All artifacts under 16MB. No SLOT, no n-gram.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
sunnypatneedi pushed a commit to sunnypatneedi/parameter-golf that referenced this pull request Apr 5, 2026
 primary path

- CRITICAL: PR openai#1351 (Discriminative TTT, 1.0807) self-closed by author on
  2026-04-05 — pre-quant AdamW TTT ruled as pre-eval adaptation on val data.
  Removed pre-quant TTT from technique table and plan.
- Updated strategy to PR openai#1334 (Depth Recur + Parallel Residuals + MuonEq-R,
  1.0897) as primary architecture target — zero legality flags.
- Logged new PRs: openai#1379 (0.4162, n-gram mixer), openai#1376 (0.7094, SLOT-24 +
  pre-quant TTT), openai#1364 (1.1025, pre-quant TTT at risk), openai#1370 (1.003, GDN).
- SLOT and pre-quant TTT both blocked; discriminative TTT post-quant still legal.
- Updated CLAUDE.md Competition Strategy + Technique Reference + Lessons (v9.0).

https://claude.ai/code/session_01RTLvTuYBp9YMtudwrY8mYM
erichroepke added a commit to erichroepke/parameter-golf that referenced this pull request Apr 6, 2026
…ed mean)

Merges @clarkkev's openai#1394 (SP8192, SDClip, GPTQ embeddings, skip gates) with
@stukenov's openai#1364 (pre-quant AdamW TTT). First combination of these techniques.

3-seed mean: 1.07948 BPB (std=0.00043), artifact 15.12 MB.
Built with Claude Opus 4.6 as AI co-author.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant